library(cleaningtools)
library(dplyr)
my_raw_dataset <- cleaningtools::cleaningtools_raw_data
my_kobo_survey <- cleaningtools::cleaningtools_survey
my_kobo_choice <- cleaningtools::cleaningtools_choices01 - R framework with IMPACT - session 1
IMPACT R framework
The IMPACT R framework has been developed to meet the requirements of the IMPACT research cycle.
The ecosystem is a modular framework with two dimensions:
- a horizontal dimension that focuses on the outcome of a given step, and
- a vertical dimension that focuses on the content of a given step.
The framework is built around:
- 4 steps:
- Cleaning: any manipulation to go from the raw data to the clean data
- Composition: any manipulation before the analysis e.g. adding indicators, adding information from loop or main, aok aggregation, etc.
- Analysis: any manipulation regarding only the analysis
- Outputs: any manipulation to format the outputs.
- 4 verbs:
- Add: functions that will add a variable (column) to the dataset.
- Check: functions that will flag values based on a specific check. It will return them in a log. A check_* will return a list: the checked dataset, and the log. The function takes a dataset as input and returns the dataset + the new indicator (and any intermediate steps used for the calculation).
- Create: functions that will create, transform something, e.g. creating a cleaning log with the checks to be filled, create analysis results table, create an output. Outputs from create_* functions outputs can be in different shape, format, etc.
- Review: functions that will review an object by comparing it to standards or another object and flags differences, e.g. reviewing the cleaning by comparing the raw dataset, the clean dataset and the cleaning log, analysis comparing it with another analysis.
- 2 adjectives:
- Pipe-able: In the framework, functions of the same family should be pipe-able. In the following case, 2 check_* functions are piped.
- Independent: At any given step, the user can change tool. Each input and each output of a step should follow the same format and definition.
These elements will help to improve cooperation and collaboration between different teams while allowing modularity to adapt to each context and assessment.
Cleaning - Checking a dataset
The following section will present some introduction about the composition.
check_outliers
my_log1 <- my_raw_dataset %>%
check_outliers(uuid_column = "X_uuid")In this example, there are:
checked_dataset: the raw dataset (with extra variables if needed)potential_outliers: a log of potential outliers
typeof(my_log1)[1] "list"
my_log1 %>%
names()[1] "checked_dataset" "potential_outliers"
The log has at least 4 columns:
uuid: the unique identifierissue: the issue being flaggedquestion: the name of the questionold_value: the value being flagged
my_log1$potential_outliers %>%
head()| uuid | issue | question | old_value |
|---|---|---|---|
| b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 | outlier (normal distribution) | age_respondent_r | 86 |
| 956b5ed0-5a62-41b7-aec3-af93fbc5b494 | outlier (normal distribution) | age_respondent_r | 84 |
| 97ad6294-30c6-454e-a0b3-42126415b767 | outlier (log distribution) | age_respondent_r | 18 |
| e005e719-57c4-44a3-ac2f-5d6d1ff68831 | outlier (log distribution) | age_respondent_r | 18 |
| c9aaa542-118f-4e42-93de-fb0916572541 | outlier (normal distribution) | num_hh_member | 19 |
| 48e8896b-d1be-4600-8839-2d8b994ebcfb | outlier (normal distribution) | num_hh_member | 19 |
Outliers are defined as +/- 3 standard deviation from the mean.
For log outliers, log(x + 1) is used.
check_duplicate
my_log2 <- my_raw_dataset %>%
check_duplicate(uuid_column = "X_uuid")
my_log2$duplicate_log %>%
head()| uuid | old_value | question | issue |
|---|
There is no duplicate. The log is empty.
Pipe-able
my_log3 <- my_raw_dataset %>%
check_outliers(uuid_column = "X_uuid") %>%
check_duplicate(uuid_column = "X_uuid")names(my_log3)[1] "checked_dataset" "potential_outliers" "duplicate_log"
my_log3$potential_outliers %>%
head()| uuid | issue | question | old_value |
|---|---|---|---|
| b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 | outlier (normal distribution) | age_respondent_r | 86 |
| 956b5ed0-5a62-41b7-aec3-af93fbc5b494 | outlier (normal distribution) | age_respondent_r | 84 |
| 97ad6294-30c6-454e-a0b3-42126415b767 | outlier (log distribution) | age_respondent_r | 18 |
| e005e719-57c4-44a3-ac2f-5d6d1ff68831 | outlier (log distribution) | age_respondent_r | 18 |
| c9aaa542-118f-4e42-93de-fb0916572541 | outlier (normal distribution) | num_hh_member | 19 |
| 48e8896b-d1be-4600-8839-2d8b994ebcfb | outlier (normal distribution) | num_hh_member | 19 |
my_log3$duplicate_log %>%
head()| uuid | old_value | question | issue |
|---|
More checks
This an example of more checks that exist.
more_logs <- my_raw_dataset %>%
check_duplicate(uuid_column = "X_uuid") %>%
check_soft_duplicates(uuid_column = "X_uuid", kobo_survey = my_kobo_survey, sm_separator = ".") %>%
check_outliers(uuid_column = "X_uuid") %>%
check_value(uuid_column = "X_uuid") add_duration
more_logs$checked_dataset <- more_logs$checked_dataset %>%
add_duration(uuid_column = "X_uuid", start_column = "X.U.FEFF.start", end_column = "end")
more_logs$checked_dataset[1:6, c("start_date", "start_time", "end_date", "end_time", "days_diff", "duration")]| start_date | start_time | end_date | end_time | days_diff | duration |
|---|---|---|---|---|---|
| 2021-07-05 | 658.57 mins | 2021-07-05 | 696.68 mins | 0 days | 38.11 |
| 2021-07-05 | 608.90 mins | 2021-07-05 | 641.92 mins | 0 days | 33.02 |
| 2021-07-05 | 682.23 mins | 2021-07-05 | 726.43 mins | 0 days | 44.20 |
| 2021-07-04 | 1342.98 mins | 2021-07-04 | 1380.15 mins | 0 days | 37.17 |
| 2021-07-04 | 1391.62 mins | 2021-07-05 | 18.88 mins | 1 days | 67.26 |
| 2021-07-05 | 617.38 mins | 2021-07-05 | 756.52 mins | 0 days | 139.14 |
The duration is added to the checked_dataset in the list, not in the my_raw_dataset dataframe. The check_* functions are used in a pipe, so it needs the current dataset to be modified.
At the moment, add_duration takes very specific format. It will change in the future to become more robust and using lubridate.
check_duration can now be used with the previous checks.
more_logs <- more_logs %>%
check_duration(column_to_check = "duration", uuid_column = "X_uuid")As much as possible, check_* functions take default argument or the functions will be able to guess some information, e.g. the check_outliers function guesses some numerical values. Some functions need more information.
other/text columns
check_other needs the list of columns to be checked. It currently, it cannot detect the open text question. KOBO tool can be used.
other_columns_to_check <- my_kobo_survey %>%
dplyr::filter(type == "text") %>%
dplyr::filter(name %in% names(my_raw_dataset)) %>%
dplyr::pull(name)
more_logs <- more_logs %>%
check_others(uuid_column = "X_uuid", columns_to_check = other_columns_to_check) check_logical_with_list
In other cases, the check is specific and should be tailored to the dataset, for example, check_logical_with_list. All the logical checks can be recorded in an excel file.
logical_check_list <- readxl::read_excel("inputs/01 - example - check_list.xlsx")
logical_check_list| check_id | description | check_to_perform | columns_to_clean |
|---|---|---|---|
| check_1 | primary_livelihood is employment but expenses less than 200000 | primary_livelihood.employment == 1 & tot_expenses < 200000 | primary_livelihood.employment, |
| tot_expenses | |||
| check_2 | acces water and tank emptied | access_water_enough == “totally_insufficient” & tank_emptied == “about_half” | access_water_enough, tank_emptied |
The check list has 4 columns:
- check_id : the name of the check
- description: the description of the check
- check_to_perform: the check to perform. The format for the check to be performed should take the format based on tidyverse. That format is as if a new indicator is create with a mutate. That new indicator should be a logical (i.e. TRUE or FALSE) with TRUE being the value to flag.
This list can then be used with check_logical_with_list.
example_logic <- my_raw_dataset %>%
check_logical_with_list(uuid_column = "X_uuid",
list_of_check = logical_check_list,
check_id_column = "check_id",
check_to_perform_column = "check_to_perform",
columns_to_clean_column = "columns_to_clean",
description_column = "description")example_logic$logical_all %>%
head()| uuid | question | old_value | issue | check_id | check_binding |
|---|---|---|---|---|---|
| f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb | primary_livelihood.employment | TRUE | primary_livelihood is employment but expenses less than 200000 | check_1 | check_1 / f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb |
| f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb | tot_expenses | 125000 | primary_livelihood is employment but expenses less than 200000 | check_1 | check_1 / f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb |
| e9f8b44c-c507-45a1-8d76-66d886437b8f | primary_livelihood.employment | TRUE | primary_livelihood is employment but expenses less than 200000 | check_1 | check_1 / e9f8b44c-c507-45a1-8d76-66d886437b8f |
| e9f8b44c-c507-45a1-8d76-66d886437b8f | tot_expenses | 175000 | primary_livelihood is employment but expenses less than 200000 | check_1 | check_1 / e9f8b44c-c507-45a1-8d76-66d886437b8f |
| 994a60b8-e640-425c-9774-160651d7af04 | primary_livelihood.employment | TRUE | primary_livelihood is employment but expenses less than 200000 | check_1 | check_1 / 994a60b8-e640-425c-9774-160651d7af04 |
| 994a60b8-e640-425c-9774-160651d7af04 | tot_expenses | 175000 | primary_livelihood is employment but expenses less than 200000 | check_1 | check_1 / 994a60b8-e640-425c-9774-160651d7af04 |
The log returns :
- uuid
- question: for all variables in columns_to_clean
- old value: for all variables in columns_to_clean
- issue
- check_id: logical check identifier
- check_binding: the combination of the check_id and the uuid.
One check can be flagged in several rows, in the example above, for each uuid, the primary_livelihood and tot_expenses are flagged.
Format for the check_to_perform should take the format based on tidyverse. That format is as if a new indicator is create with a mutate. That new indicator should be a logical (i.e. TRUE or FALSE) with TRUE being the value to flag.
my_raw_dataset %>%
dplyr::mutate(xxx = primary_livelihood.employment == 1 & tot_expenses < 200000) %>%
dplyr::select(X_uuid, xxx, primary_livelihood.employment, tot_expenses) %>%
head()| X_uuid | xxx | primary_livelihood.employment | tot_expenses |
|---|---|---|---|
| dcf2753a-6ea2-40f5-b493-3527931ef96c | FALSE | FALSE | 250000 |
| 8790ce5c-1c35-41a2-b3c0-538f937d5397 | FALSE | TRUE | 750000 |
| bb818e04-9c40-408e-919f-6b40ff1fdbb3 | FALSE | FALSE | 250000 |
| 28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02 | FALSE | TRUE | 600000 |
| 7f2a0c6a-529b-481f-963f-a96dca2ec034 | FALSE | TRUE | 500000 |
| b4f92064-12ea-4970-b0f5-fd309de1dda3 | FALSE | FALSE | 650000 |
The checked dataset will be return with extra columns, i.e. a logical variable with the name of the check_id.
example_logic$checked_dataset[1:6,tail(names(example_logic$checked_dataset))]| X_notes | X_status | X_submitted_by | X_index | check_1 | check_2 |
|---|---|---|---|---|---|
| [] | submitted_via_web | reach_irq | 1 | FALSE | FALSE |
| [] | submitted_via_web | reach_irq | 2 | FALSE | FALSE |
| [] | submitted_via_web | reach_irq | 3 | FALSE | FALSE |
| [] | submitted_via_web | reach_irq | 4 | FALSE | FALSE |
| [] | submitted_via_web | reach_irq | 5 | FALSE | FALSE |
| [] | submitted_via_web | reach_irq | 6 | FALSE | FALSE |
If you don’t include columns_to_clean the check_logical_with_list function will try to guess the variables. Not guarantee it will read or pick the correct names.
Exercises
Exercise 1
Try the following with a dataset:
- Perform a check to spot personal identifiable information
library(cleaningtools)
library(dplyr)
my_raw_dataset <- cleaningtools::cleaningtools_raw_data
my_kobo_survey <- cleaningtools::cleaningtools_survey
my_kobo_choice <- cleaningtools::cleaningtools_choicesExercise 2
- Perform a check that will look at the percentages of missing value per observation and that will spot any observation that is different.
Exercise 3
- Fill the excel checklist to do the following checks:
- household number (variable: num_hh_member) is above 8.
- the water source for drinking water is bottled (variable: water_source_drinking, value: bottled) and the household always treat the drinking water (variable: treat_drink_water, value: always_treat).
- the water source for drinking water is bottled (variable: water_source_drinking, value: bottled) and one of the main reason for the the household to not meet its water needs is the water pressure (variable: access_water_enough_why_not, value: water_pressure, this is a select multiple)
| name | label::English (en) | type |
|---|---|---|
| num_hh_member | How many members are there in your household (including you)? | integer |
| water_source_drinking | What is the main source of water used by your household for drinking? | select_one water_sources |
| treat_drink_water | Does your household treat this water in any way to make it safer to drink? | select_one treat |
| access_water_enough_why_not | What are the main reasons your household is not able to meet its water needs? | select_multiple barriers_water_needs |
| name | label::English (en) | list_name |
|---|---|---|
| bottled | Bottled or sachet water | water_sources |
| always_treat | Yes, always treat it before drinking | treat |
| water_pressure | Water pressure is not high enough/pumps required | barriers_water_needs |
exercise_check_list <- readxl::read_excel("inputs/01 - example - check_list.xlsx")
my_raw_dataset %>%
check_logical_with_list(uuid_column = "X_uuid",
list_of_check = exercise_check_list,
check_id_column = "check_id",
check_to_perform_column = "check_to_perform",
columns_to_clean_column = "columns_to_clean",
description_column = "description")Extra
- Try to add duration with audit files.